Purpose

The purpose of this markdown document is to list the steps we followed for refining the models.

Notes

Data

Response variables

  • Response variables to consider
    • Number of unhealthy trees
    • Tree canopy symptoms

Below we compare models using canopy symptoms as the response variable

Explanatory variables

  • Explanatory variables included
    • iNat data
      • explanatory variables such as tree size
    • Climate data
      • 30yr normals 1991-2020
      • last decade 2011-2020
    • Soils data
      • gSSURGGO
        • muggaggat
        • component

Lets add summaries for how many variables each of these datasets provided

Note there may be a third soils dataset to incorporate. Also, need to confirm the normals data is actually the latest normal data.

The data used in the below models are described in the Data Wrangle folder.

Data wrangling

There are multiple methods to group the response variables deepening on desired resolution or fineness of the model.

For now, we can move forward with the binary response grouping because it is the broadest and easiest for the model to classify with.

Filter Data

Filter trees to only those with soils data (Oregon and Washington)

All tree health categories

## # A tibble: 11 × 2
## # Groups:   field.tree.canopy.symptoms [11]
##    field.tree.canopy.symptoms                             n
##    <fct>                                              <int>
##  1 Branch Dieback or 'Flagging'                          19
##  2 Browning Canopy                                       19
##  3 Extra Cone Crop                                        2
##  4 Healthy                                              403
##  5 Multiple Symptoms (please list in Notes)              17
##  6 New Dead Top (red or brown needles still attached)    33
##  7 Old Dead Top (needles already gone)                   83
##  8 Other (please describe in Notes)                       8
##  9 Thinning Canopy                                      118
## 10 Tree is dead                                          37
## 11 Yellowing Canopy                                      10

We also need to filter the data to only include response and explanatory variables we’re interested in. For example, whether a sound clip was included in the iNat data is not important.

We also need to remove other response variables like “field.percent.canopy.affected….” so it is not used as a predictor for tree health.

Note it might be interesting to know if the user was an important factor in predicting if the tree is healthy/unhealthy.

There are also a number of factors that should probably be removed because they may be biasing the data. For example, only trees with the ‘other factor’ question may only be answered for unhealthy trees. We need to think about this a bit more.

Remove variables with variables that have near zero standard deviations (entire column is same value)

Impute data

Impute data

Overall Imputed data

We continue to get the below error, but were able to work around it by imputing the data.

Error in randomForest.default(m, y, …) : Need at least two classes to do classification.

To impute the data we have to remove factors with >53 levels.

The below code lists the number of levels for the variables that are factors.

  • The following factors had more than 53 levels
    • “muaggatt_musym”
    • “muaggatt_muname”
    • “component_compname”
    • “component_geomdesc”
    • “component_taxclname”
    • “component_taxsubgrp”

Imputed data table

## ntree      OOB      1      2      3      4      5      6      7      8      9     10     11
##   300:  46.06% 94.74% 94.74%100.00% 16.87% 76.47% 87.88% 71.08%100.00% 72.88% 91.89%100.00%
## ntree      OOB      1      2      3      4      5      6      7      8      9     10     11
##   300:  45.79% 94.74% 94.74%100.00% 15.63% 82.35% 87.88% 72.29%100.00% 72.03% 97.30%100.00%
## ntree      OOB      1      2      3      4      5      6      7      8      9     10     11
##   300:  45.39% 94.74%100.00%100.00% 14.39% 76.47% 87.88% 74.70%100.00% 72.88% 94.59%100.00%
## ntree      OOB      1      2      3      4      5      6      7      8      9     10     11
##   300:  46.73% 94.74% 94.74%100.00% 16.38% 76.47% 87.88% 75.90%100.00% 74.58% 94.59%100.00%
## ntree      OOB      1      2      3      4      5      6      7      8      9     10     11
##   300:  45.93%100.00% 94.74%100.00% 15.14% 82.35% 87.88% 75.90%100.00% 72.03% 94.59%100.00%
## ntree      OOB      1      2      3      4      5      6      7      8      9     10     11
##   300:  46.06% 94.74% 94.74%100.00% 16.13% 76.47% 87.88% 73.49%100.00% 72.03% 97.30%100.00%

Group response variables to binary values

Binary tree health categories

## # A tibble: 2 × 2
## # Groups:   field.tree.canopy.symptoms [2]
##   field.tree.canopy.symptoms     n
##   <fct>                      <int>
## 1 Healthy                      403
## 2 Unhealthy                    346

Approach

Models

Comparing full and reduced datasets of explanatory variables

  • Datasets compared below
    • Full model = all climate variables retained.
    • Monthless model = filtered out the climate parameters for individual months
    • Normal only, monthless model = removed decadal data

Full Model

## 
## Call:
##  randomForest(formula = field.tree.canopy.symptoms ~ ., data = binary,      ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 2001
## No. of variables tried at each split: 23
## 
##         OOB estimate of  error rate: 26.44%
## Confusion matrix:
##           Healthy Unhealthy class.error
## Healthy       303       100    0.248139
## Unhealthy      98       248    0.283237

Monthless Model

## 
## Call:
##  randomForest(formula = field.tree.canopy.symptoms ~ ., data = monthless.binary,      ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 2001
## No. of variables tried at each split: 15
## 
##         OOB estimate of  error rate: 28.7%
## Confusion matrix:
##           Healthy Unhealthy class.error
## Healthy       293       110   0.2729529
## Unhealthy     105       241   0.3034682

Normal, Monthless Model

## 
## Call:
##  randomForest(formula = field.tree.canopy.symptoms ~ ., data = normal.monthless.binary,      ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 2001
## No. of variables tried at each split: 13
## 
##         OOB estimate of  error rate: 29.51%
## Confusion matrix:
##           Healthy Unhealthy class.error
## Healthy       284       119   0.2952854
## Unhealthy     102       244   0.2947977

Comparing grouping variables